Predictive Modeling Discussions¶
a. Are you working on a REGRESSION or CLASSIFICATION problem?
The given problem can be approached as both REGRESSION OR CLASSIFICATION problem. I have decided to approach this as a classification problem by categorizing tracks with popularity < 50 as 0 (Unpopular) and songs with popularity > 50 as 1 (Popular) in a new columns called as
track_popularity_bin. The goal is to come up with a classification model that would classify a track based on the characteristics as 0 or 1b. Which variables are inputs?
Following variables are the final input variables I have identifed after EDA
- danceability
- energy
- key
- loudness
- mode
- echiness
- acousticness
- instrumentalness
- liveness
- valence
- tempo
- duration_ms
c. Which variables are responses/outputs/outcomes/targets?
track_popularity_binis the target variable
d. Did you need to DERIVE the responses of interest by SUMMARIZING the available data?
- Yes
e. If so, what summary actions did you perform?
- Grouped the songs with
track_popularity< 50 as 0 and track_popularity > 50 as 1 in a new column called astrack_popularity_bin
- Grouped the songs with
f. Which variables are identifiers and should NOT be used in the models?
- track_id
- track_album_id
- playlist_id
- track_name
- playlist_name
- track_artist
g. Important: Answer this question after completing parts C and D. Return to this predictive modeling discussion section to answer the following:
i. Which of the inputs do you think influence the response, based on your exploratory visualizations? Which exploratory visualization helped you identify potential input-to-output relationships? (If you are not sure which inputs seem to influence the response, it is okay to say so.)
Answer: Following visualizations helped with identifying the potential input-to-output relationships
- Conditional Distribution of continuous variables GROUPED BY the response (target) variable
- Relationships between continuous variables GROUPED BY the response (target) variable
- Conditional Distribution of continuous variables GROUPED BY the response (target) variable and additional categorical variable
Inputs that influence response: Continuous variables that represent the characteristics of a track (danceability, energy, loudness, valence, tempo etc.,) influence the response (target) variable
Import Modules¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("colorblind")
Loading the Dataset¶
The following steps loads the dataset from the given URL into a pandas dataframe named df
songs_url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
df_main = pd.read_csv(songs_url)
# Creating a copy. Keeping the main df intact in case needed for further analysis
df = df_main.copy()
Basic Info about the dataset¶
First, I find the dimensionality of the pandas dataframe using the df.shape method. This tells how many rows, columns are there in the dataframe. In this case it is 32833 rows and 23 columns
df.shape
(32833, 23)
Then, I'm exploring the datatypes, count of not-null values in every column of the dataset using the df.info() method
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32833 entries, 0 to 32832 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_id 32833 non-null object 1 track_name 32828 non-null object 2 track_artist 32828 non-null object 3 track_popularity 32833 non-null int64 4 track_album_id 32833 non-null object 5 track_album_name 32828 non-null object 6 track_album_release_date 32833 non-null object 7 playlist_name 32833 non-null object 8 playlist_id 32833 non-null object 9 playlist_genre 32833 non-null object 10 playlist_subgenre 32833 non-null object 11 danceability 32833 non-null float64 12 energy 32833 non-null float64 13 key 32833 non-null int64 14 loudness 32833 non-null float64 15 mode 32833 non-null int64 16 speechiness 32833 non-null float64 17 acousticness 32833 non-null float64 18 instrumentalness 32833 non-null float64 19 liveness 32833 non-null float64 20 valence 32833 non-null float64 21 tempo 32833 non-null float64 22 duration_ms 32833 non-null int64 dtypes: float64(9), int64(4), object(10) memory usage: 5.8+ MB
Also, below is the description of each of the column in the dataset
| variable | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
Analyzing target/output variable track_popularity¶
Above histogram shows that track_popularity is mostly normally distributed but the value 0 has lot more entries that other values. We can further confirm that using the following boxplot
Analysis¶
df.track_popularity.describe()
count 32833.000000 mean 42.477081 std 24.984074 min 0.000000 25% 24.000000 50% 45.000000 75% 62.000000 max 100.000000 Name: track_popularity, dtype: float64
df.track_id.nunique()
28356
sns.boxplot(x=df["track_popularity"], showmeans=True, width=0.2)
<Axes: xlabel='track_popularity'>
sns.displot(data = df, x='track_popularity', binwidth=5, aspect=1.25)
plt.show()
# percentage of tracks with `track_popularity` = 0
print('percentage of tracks with `track_popularity` as 0 = ', np.mean( df.track_popularity == 0 ) * 100, '%')
# percentage of tracks with `track_popularity` = 100
print('percentage of tracks with `track_popularity` as 100 = ', np.mean( df.track_popularity == 100 ) * 100, '%')
percentage of tracks with `track_popularity` as 0 = 8.23257088904456 % percentage of tracks with `track_popularity` as 100 = 0.0060914324003289375 %
⭐ Above plots reveal that although track_popularity is an integer column it is not a continous output. The values are between 0 and 100. A linear regression would work best if the output is continous. Since that is not the case this can be better approached classification problem.
✨ To do that I will be creating a new column called as track_popularity_bin. Tracks with track_popularity > 50 will be considered as 1 (popular) and the ones <= 50 will be considered as 0 (unpopular)
df['track_popularity_bin'] = np.where( df.track_popularity > 50, 1, 0 )
df = df.astype({'track_popularity_bin': 'object'})
df.track_popularity_bin.value_counts(normalize=True)
track_popularity_bin 0 0.574757 1 0.425243 Name: proportion, dtype: float64
💡 Although not perfectly balanced, the binary outcome is not overly imbalanced and so conventional classification approaches can be applied.
Handle Duplicates¶
The dataframe has 32833 rows but the above analysis tells me that each row doesn't have a unique track_id and that is why number of unique track_id is 28356 and not 32833
Then, I look to see if the charactersistics of the duplicated tracks change across the entry
track_characteristics=['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
for tc in track_characteristics:
print(f'==={tc}===')
print(df.groupby(['track_id']).\
aggregate(num_track_pop_values = ('track_popularity', 'nunique'),
num_charc_values = (tc, 'nunique')).\
reset_index().\
nunique())
===danceability=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===energy=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===key=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===loudness=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===mode=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===speechiness=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===acousticness=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===instrumentalness=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===liveness=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===valence=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64 ===tempo=== track_id 28356 num_track_pop_values 1 num_charc_values 1 dtype: int64
💡 The above analysis reveals that there is 1 and only 1 value for num_track_pop_values and num_valence_values. Thus, all unique tracks have a single track_popularity value and single value for the characteristics.
📌 Based on this info, we can remove the duplicate tracks by retaining only the first occurance of a track_id
df.drop_duplicates(subset=['track_id'], keep='first', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 28356 entries, 0 to 32832 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_id 28356 non-null object 1 track_name 28352 non-null object 2 track_artist 28352 non-null object 3 track_popularity 28356 non-null int64 4 track_album_id 28356 non-null object 5 track_album_name 28352 non-null object 6 track_album_release_date 28356 non-null object 7 playlist_name 28356 non-null object 8 playlist_id 28356 non-null object 9 playlist_genre 28356 non-null object 10 playlist_subgenre 28356 non-null object 11 danceability 28356 non-null float64 12 energy 28356 non-null float64 13 key 28356 non-null int64 14 loudness 28356 non-null float64 15 mode 28356 non-null int64 16 speechiness 28356 non-null float64 17 acousticness 28356 non-null float64 18 instrumentalness 28356 non-null float64 19 liveness 28356 non-null float64 20 valence 28356 non-null float64 21 tempo 28356 non-null float64 22 duration_ms 28356 non-null int64 23 track_popularity_bin 28356 non-null object dtypes: float64(9), int64(4), object(11) memory usage: 5.4+ MB
Exploratory Data Analysis¶
Performing essential EDA using pandas methods¶
- Missing Values
- Unique Values
First, I look to see number of missing values for each column. This is important to know so that I can elimate columns with too many missing values as those columns won't provide much insights about the dataset
df.isna().sum()
track_id 0 track_name 4 track_artist 4 track_popularity 0 track_album_id 0 track_album_name 4 track_album_release_date 0 playlist_name 0 playlist_id 0 playlist_genre 0 playlist_subgenre 0 danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 track_popularity_bin 0 dtype: int64
The above analysis shows that only three columns track_name, track_artist, track_album_name have missing values and that too very minimal number of rows with missing value. So far, I have not dropped any columns based on missing values.
It is now time to look at number of unique values for each columns. Columns with too many unique values may not be informative or may lead to overfitting, while columns with too few unique values may not provide enough discriminative power.
df.nunique(dropna=False)
track_id 28356 track_name 23450 track_artist 10693 track_popularity 101 track_album_id 22545 track_album_name 19744 track_album_release_date 4530 playlist_name 448 playlist_id 470 playlist_genre 6 playlist_subgenre 24 danceability 822 energy 952 key 12 loudness 10222 mode 2 speechiness 1270 acousticness 3731 instrumentalness 4729 liveness 1624 valence 1362 tempo 17684 duration_ms 19785 track_popularity_bin 2 dtype: int64
📌 Above analysis reveals that although key and mode are numeric columns those have only few unique values. So, we can treat these columns as categorical for analysis purposes
Analyzing and Visualizing Categorical Variables¶
Based on the dataframe info we can determine that following columns are categorical variables
- track_id
- track_name
- track_artist
- track_album_id
- track_album_name
- track_album_release_date
- playlist_name
- playlist_id
- playlist_genre
- playlist_subgenre
df.describe(include='object')
| track_id | track_name | track_artist | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | track_popularity_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 28356 | 28352 | 28352 | 28356 | 28352 | 28356 | 28356 | 28356 | 28356 | 28356 | 28356 |
| unique | 28356 | 23449 | 10692 | 22545 | 19743 | 4530 | 448 | 470 | 6 | 24 | 2 |
| top | 6f807x0ima9a1j3VPbc7VN | Breathe | Queen | 5L1xcowSxwzFUSJzvyMp48 | Greatest Hits | 2020-01-10 | Indie Poptimism | 72r6odw0Q3OWTCYMGA7Yiy | rap | southern hip hop | 0 |
| freq | 1 | 18 | 130 | 42 | 135 | 201 | 294 | 100 | 5401 | 1583 | 17850 |
❌ There are far too many unique values in the columns track_artist, playlist_name, track_album_name, track_name. This makes these columns not very useful for training a model and it is not practical to show visualization for these columns. Note: Reg visualization I confirmed with the instructor on the Coursera Discussion Forum that it is not necessary to show visualization for categorical variables with far too many unique values.
❌ Identifier columns like track_id, track_album_id and playlist_id will also not be very useful
df.drop(['track_id','track_album_id','playlist_id','track_artist','playlist_name', 'track_name', 'track_album_name'],
inplace=True,
axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 28356 entries, 0 to 32832 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_popularity 28356 non-null int64 1 track_album_release_date 28356 non-null object 2 playlist_genre 28356 non-null object 3 playlist_subgenre 28356 non-null object 4 danceability 28356 non-null float64 5 energy 28356 non-null float64 6 key 28356 non-null int64 7 loudness 28356 non-null float64 8 mode 28356 non-null int64 9 speechiness 28356 non-null float64 10 acousticness 28356 non-null float64 11 instrumentalness 28356 non-null float64 12 liveness 28356 non-null float64 13 valence 28356 non-null float64 14 tempo 28356 non-null float64 15 duration_ms 28356 non-null int64 16 track_popularity_bin 28356 non-null object dtypes: float64(9), int64(4), object(4) memory usage: 3.9+ MB
Create new columns¶
The given data presents us with the opportunity to create additional columns. These new columns may have an impact on the target variable.
In this case, I have used the column track_album_release_date to create two new columns release_year and release_month
df['track_album_release_date'] = pd.to_datetime(df['track_album_release_date'], format='mixed')
df['release_year'] = df.track_album_release_date.dt.year
sns.catplot(data = df, y='release_year', kind='count', height=10, aspect=2)
plt.show()
💡 As seen above, the dataset is mostly made of songs released in the recent years rather than older years. This suggests that we can probably create one more variable that represents the release_year in two buckets. We can put songs released after 2010 in one bucket and songs released before 2010 in another
df['release_year_bin'] = np.where( df.release_year < 2010 , 'older', 'recent')
df.release_year_bin.value_counts()
release_year_bin recent 20460 older 7896 Name: count, dtype: int64
Next, I created the release_month column
df['release_month'] = df.track_album_release_date.dt.month
sns.catplot(data = df, y='release_month', kind='count', height=8, aspect=1.5)
plt.show()
💡 Majority of songs in the given dataset are release in the month of Jan
Since, track_album_release_date has been split into release_year and release_month the column can be dropped
df.drop(['track_album_release_date'], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 28356 entries, 0 to 32832 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_popularity 28356 non-null int64 1 playlist_genre 28356 non-null object 2 playlist_subgenre 28356 non-null object 3 danceability 28356 non-null float64 4 energy 28356 non-null float64 5 key 28356 non-null int64 6 loudness 28356 non-null float64 7 mode 28356 non-null int64 8 speechiness 28356 non-null float64 9 acousticness 28356 non-null float64 10 instrumentalness 28356 non-null float64 11 liveness 28356 non-null float64 12 valence 28356 non-null float64 13 tempo 28356 non-null float64 14 duration_ms 28356 non-null int64 15 track_popularity_bin 28356 non-null object 16 release_year 28356 non-null int32 17 release_year_bin 28356 non-null object 18 release_month 28356 non-null int32 dtypes: float64(9), int32(2), int64(4), object(4) memory usage: 4.1+ MB
💡 Let's visualize additional variables - key and mode
Although, key and mode are numeric columns, since there are less unique values these can be considered as categorical variables
sns.catplot(data = df, y='key', kind='count', height=5, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x7fda0b7e1510>
sns.catplot(data = df, y='mode', kind='count', height=2, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x7fd9b6c68df0>
Analyzing and Visualizing Continous Variables¶
df.describe()
| track_popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | release_year | release_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.000000 | 28356.00000 | 28356.000000 | 28356.000000 | 28356.000000 |
| mean | 39.329771 | 0.653372 | 0.698388 | 5.368000 | -6.817696 | 0.565489 | 0.107954 | 0.177176 | 0.091117 | 0.190958 | 0.510387 | 120.95618 | 226575.967026 | 2011.054027 | 6.101813 |
| std | 23.702376 | 0.145785 | 0.183503 | 3.613904 | 3.036243 | 0.495701 | 0.102556 | 0.222803 | 0.232548 | 0.155894 | 0.234340 | 26.95456 | 61078.450819 | 11.229221 | 3.841027 |
| min | 0.000000 | 0.000000 | 0.000175 | 0.000000 | -46.448000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 4000.000000 | 1957.000000 | 1.000000 |
| 25% | 21.000000 | 0.561000 | 0.579000 | 2.000000 | -8.309250 | 0.000000 | 0.041000 | 0.014375 | 0.000000 | 0.092600 | 0.329000 | 99.97200 | 187742.000000 | 2008.000000 | 2.000000 |
| 50% | 42.000000 | 0.670000 | 0.722000 | 6.000000 | -6.261000 | 1.000000 | 0.062600 | 0.079700 | 0.000021 | 0.127000 | 0.512000 | 121.99300 | 216933.000000 | 2016.000000 | 6.000000 |
| 75% | 58.000000 | 0.760000 | 0.843000 | 9.000000 | -4.709000 | 1.000000 | 0.133000 | 0.260000 | 0.006570 | 0.249000 | 0.695000 | 133.99900 | 254975.250000 | 2019.000000 | 10.000000 |
| max | 100.000000 | 0.983000 | 1.000000 | 11.000000 | 1.275000 | 1.000000 | 0.918000 | 0.994000 | 0.994000 | 0.996000 | 0.991000 | 239.44000 | 517810.000000 | 2020.000000 | 12.000000 |
The above table shows the basic statistics about the continous variables
For analyzing the continous variables, I start with creating a new data frame df_lf that will hold the data in the LONG FORMAT. This is done so that visualizations can be achieved easier using Seaborn
df_features = df.select_dtypes('number').copy()
df_features.drop(['track_popularity'], axis=1, inplace=True) # Dropping Target variable
df_objects = df.select_dtypes('object').copy()
id_cols = ['rowid', 'track_popularity'] + df_objects.columns.to_list()
df_lf = df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
df_lf
| rowid | track_popularity | playlist_genre | playlist_subgenre | track_popularity_bin | release_year_bin | variable | value | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 66 | pop | dance pop | 1 | recent | danceability | 0.748 |
| 1 | 1 | 67 | pop | dance pop | 1 | recent | danceability | 0.726 |
| 2 | 2 | 70 | pop | dance pop | 1 | recent | danceability | 0.675 |
| 3 | 3 | 60 | pop | dance pop | 1 | recent | danceability | 0.718 |
| 4 | 4 | 69 | pop | dance pop | 1 | recent | danceability | 0.650 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396979 | 32828 | 42 | edm | progressive electro house | 0 | recent | release_month | 4.000 |
| 396980 | 32829 | 20 | edm | progressive electro house | 0 | recent | release_month | 3.000 |
| 396981 | 32830 | 14 | edm | progressive electro house | 0 | recent | release_month | 4.000 |
| 396982 | 32831 | 15 | edm | progressive electro house | 0 | recent | release_month | 1.000 |
| 396983 | 32832 | 27 | edm | progressive electro house | 0 | recent | release_month | 3.000 |
396984 rows × 8 columns
💡 To visualize continous variable we will be using Histograms and KDE plots
sns.displot(data = df_lf, x='value', col='variable', kind='hist', kde=True,
facet_kws={'sharex': False, 'sharey': False},
common_bins=False,
col_wrap=3)
plt.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.tight_layout()
plt.show()
💡 Observation¶
We can see that speechiness, instrumentalness, liveness, acousticness are skewed right.
loudness, mode, time_signature are skewed left.
Only danceability, energy, valence, tempo have normal / approx normal distribution.
📌 Before we can use the data for modeling, we must transform the left and right skewed features have more symmetrical and bell-shaped distributions.
Visualize Relatiopnships¶
Categorical-to-categorical relationships¶
track_popularity_bin Vs playlist_genre¶
sns.catplot( data = df, x='track_popularity_bin', hue='playlist_genre', kind='count' )
plt.show()
💡 The above visualization between track_popularity_bin and playlist_genre using DODGED BAR CHART shows that the edm genre is the most unpopular genre. Other genres have approx same number of tracks across both unpopular and popular categories. We can also see that pop and rap are the most popular genres
playlist_subgenre Vs playlist_genre¶
Next, visualizing the relationship between playlist_subgenre and playlist_genre using HEATMAP
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap( pd.crosstab( df.playlist_subgenre, df.playlist_genre ), ax = ax,
annot=True, annot_kws={'size': 10}, fmt='d',
cbar=False)
plt.show()
📌 Above heatmap shows that the playlist_subgenre is highly correlated with playlist_genre. With this information, playlist_subgenre field can be dropped from the dataset as having highly correlated variables doesn't add a lot of value for the final model.
df.drop(columns=['playlist_subgenre'], inplace=True)
Categorial to Continuous¶
Key Vs Track Popularity¶
sns.catplot( data = df, x='key', y='track_popularity', kind='point', linestyle='none')
plt.show()
💡 Tracks with higher popularity tend to have higher value for key
Mode Vs Track Popularity¶
sns.catplot( data = df, x='mode', y='track_popularity', kind='point', linestyle='none')
plt.show()
💡 Tracks with higher popularity tend to have 1 as the mode
Release Year Vs Track Popularity¶
sns.boxplot(data=df, x="release_year_bin", y="track_popularity", showmeans=True)
<Axes: xlabel='release_year_bin', ylabel='track_popularity'>
The above observation tells us that songs released recently have higher popularity compared to the songs released in the older years. This gives us a good indication that release_year_bin can have impact on the popularity.
Release Month Vs Track Popularity¶
sns.catplot(data=df, x="release_month", y="track_popularity", kind='point', aspect=2, linestyle='none')
<seaborn.axisgrid.FacetGrid at 0x7fd9edf9f6d0>
💡 Tracks that were released during the months of October, November and December tend to have higher popularity.
Continuous-to-Continuous Relationships¶
Corr plot for ALL variables¶
Correlation plot is one of the very effective way to view the relationships between continuous variables
Below is the corr plot for the given dataset
fig, ax = plt.subplots(figsize=(20,15))
sns.heatmap(data = df.select_dtypes('number').corr(numeric_only=True),
vmin=-1, vmax=1, center=0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 12},
ax=ax)
plt.tight_layout()
plt.show()
💡 The above plot reveals the following
energyandloudnessare highly positively correlatedenergyandacousticnessare highly negatively correlatedtrack_popularityhas a low correlation with all other variables. This is good because we can use the other variables to "predict" track popularity
Next, visualizing the relationships between few different continuous variables
energy Vs danceability¶
sns.relplot(data = df, x='energy', y='danceability')
plt.show()
💡Tracks with higher energy tend to be more danceable
acousticness Vs loudness¶
sns.relplot(data = df, x='acousticness', y='loudness')
plt.show()
💡 Tracks that are more acoustic tend to be less louder than the tracks that are less acoustic
valence Vs danceability¶
sns.relplot(data = df, x='valence', y='danceability')
plt.show()
💡 Danceability of tracks increase with increase in valence
Visualize conditional distributions of the continuous inputs GROUPED BY the response (outcome) unique values¶
Will be visualizing the conditional distribution of continuous inputs group by the response variable track_popularity_bin
sns.displot(data = df_lf, x='value', col='variable', kind='kde',
hue='track_popularity_bin',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False,
col_wrap=3
)
plt.show()
💡 Observations
danceabilityof popular songs (track_popularity_bin=1) is higher than the unpopular songs (track_popularity_bin=0)acousticnesson popular songs is lower than the unpopular songs- popular songs tend to have lesser duration compared to the unpopular songs
Visualize conditional distributions of the continuous inputs GROUPED BY the response (outcome) variable and additional categorical variable¶
sns.catplot(data = df_lf, x='track_popularity_bin', y='value', col='variable',
hue='playlist_genre',
kind='box',
sharey=False,
showmeans=True,
col_wrap=3,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
💡 Observations
- The
danceabilityscore is higher for pop and rap tracks in the popular category compared to the unpopular category - Tracks of the genre rock tend to be longer in duration than other genres
loudnessseems to be higher for all genres in the popular category compared to the unpopular category- The distribution of the continuous variables vary for each genre. This allows to infer that the continuous variables by themselves is sufficient to determine the popularity of a track without relying on the genre
Visualize relationships between continuous inputs GROUPED BY the response (outcome) unique values¶
Here, leveraging SCATTER PLOT to visualize the relationships between continuous inputs GROUPED BY the response (outcome) unique values
tempo Vs valence GROUPED BY track_popularity_bin¶
sns.relplot(data=df, x='tempo', y='valence', hue='track_popularity_bin')
plt.show()
💡 Popular tracks tend (track_popularity_bin=1) to have higher temp and valence compated to the unpopular tracks (track_popularity_bin=0)
tempo Vs danceability GROUPED BY track_popularity_bin¶
sns.relplot(data=df, x='tempo', y='danceability', hue='track_popularity_bin')
plt.show()
💡Observations
- The
danceabilityincreases astempoincreases - Popular songs tend to have higher tempo(track_popularity_bin=1) and danceability compared to the unpopular tracks (track_popularity_bin=1)
- Given the
tempoanddanceabilityscore one can determine the popularity of a track
acousticness Vs loudness GROUPED BY track_popularity_bin¶
sns.relplot(data=df, x='acousticness', y='loudness', hue='track_popularity_bin')
plt.show()
💡Tracks that are more acoustic tend to be less louder
sns.relplot(data=df, x='duration_ms', y='valence', hue='track_popularity_bin')
plt.show()
💡 Observations
- Popular tracks sound more positive (higher
valencevalue) and these songs have a shorter duration - Unpopular tracks sound more negative (lower
valencevalue) and these songs have a longer duration
Next up, PAIR PLOT is used to visualize the relationships between the other continuous variables grouped by the target variable that weren't discussed above
sns.pairplot(data=df[[ 'energy', 'key', 'mode',
'speechiness', 'instrumentalness',
'liveness', 'track_popularity_bin']],
hue='track_popularity_bin',
diag_kws={'common_norm': False})
plt.show()
💡 Observations
- Track that have low
speechinesstend to have highinstrumentalness - Some popular tracks tend to have high
speechinessas well as highenergy. This could be because of tracks fromrapgenre that are speechy as well as highly energetic - Tracks that have high
livenesstend to have highenergy. This could be because songs that have high energy are more probable to be performed infront of live audience - Popular songs that have a
modeof 1 tend to have highkeyvalue
Visualize the counts of combinations between the response (outcome) and categorical inputs¶
sns.catplot(data = df, x='playlist_genre', hue='track_popularity_bin', col='release_year_bin', kind='count', )
plt.show()
💡 Observations
rockused to be the most popular genre in the older years and it has been replaced bypopandrapin the recent yearsedmis the most unpopular genre in the recent years
df.playlist_genre.value_counts(normalize=True)
K-Means Clustering¶
Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. For this exercise, we will be using the K-Means method for clustering the given dataset
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
columns_to_use = ['danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms',
'track_popularity_bin', 'playlist_genre']
df_kmeans = df[columns_to_use]
Preprocessing¶
df_kmeans.isna().sum()
danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 track_popularity_bin 0 playlist_genre 0 dtype: int64
💡 There are no MISSING VALUES in the dataset
df_kmeans_features_clean = df_kmeans.select_dtypes('number').copy()
sns.catplot(data = df_kmeans_features_clean, kind='box', aspect=2)
plt.show()
💡 Since one variable is dominant the data has to be standardized first remove the MAGNITUDE and SCALE effect. KMeans considers SIMILAR to be based on DISTANCE. Distance depends on MAGNITUDE and SCALE
# Using sklearn StandardScaler to standardize the dataset
X = StandardScaler().fit_transform(df_kmeans_features_clean)
sns.catplot(data = pd.DataFrame(X, columns=df_kmeans_features_clean.columns), kind='box', aspect=2)
plt.show()
📌 Variables have now been standardized
Clustering¶
Starting with two clusters¶
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict(X)
df_kmeans_clean_copy = df_kmeans.copy()
df_kmeans_clean_copy['k2'] = pd.Series(clusters_2, index=df_kmeans_clean_copy.index ).astype('category')
df_kmeans_clean_copy.k2.value_counts()
k2 0 20078 1 8278 Name: count, dtype: int64
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(data = pd.crosstab(df_kmeans_clean_copy.track_popularity_bin,
df_kmeans_clean_copy.k2,
margins=True ),
annot=True,
annot_kws={"fontsize": 20},
fmt='g',
cbar=False,
ax=ax)
plt.show()
The above heatmap tells us that most songs have ended up in cluster 0. This suggests that more clusters are needed to find songs with distinctive characteristics that end up in the popular and unpopular categories
Finding Optimal number of clusters¶
Will be finding optimal number of clusters using the KNEE BEND PLOT!
tots_within = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
km = km.fit(X)
tots_within.append( km.inertia_ )
fig, ax = plt.subplots()
ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')
plt.show()
📌 Although, there isn't a clean KNEE BEND here we can see that the plot starts to bend around the cluster value of 5 and is prominent before 8. So, I have decided to go with 7 clusters for further analysis
clusters_7 = KMeans(n_clusters=7, random_state=121, n_init=25, max_iter=500).fit_predict(X)
df_kmeans_clean_copy = df_kmeans.copy()
df_kmeans_clean_copy['k7'] = pd.Series(clusters_7, index=df_kmeans_clean_copy.index ).astype('category')
df_kmeans_clean_copy.k7.value_counts()
k7 3 6844 4 5963 2 5038 6 3510 5 3123 0 2211 1 1667 Name: count, dtype: int64
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(data = pd.crosstab(df_kmeans_clean_copy.track_popularity_bin,
df_kmeans_clean_copy.k7,
margins=True ),
annot=True,
annot_kws={"fontsize": 20},
fmt='g',
cbar=False,
ax=ax)
plt.show()
💡 The above heatmap tells that clusters 0,1,2 tend to have more unpopular songs than clusters 3,4,5,6. This is much better than two clusters
Visualizing relationships and conditional distributions using PAIR PLOT¶
Finally, using PAIR PLOT to visualize the relationships between continuous variables GROUPED BY the cluster category as well as conditional distribution of each continuous variable GROUPED BY the cluster category
# NOTE: I have used a sample of 5000 because my notebook kept crashing
sns.pairplot(data = df_kmeans_clean_copy.sample(5000), hue='k7', diag_kws={'common_norm': False},
palette='viridis')
plt.show()
💡 Observations
- Tracks in the clusters 0,1,2 tend to have higher
energycompared to the clusters 3,4,5,6 - Tracks in cluster 5 have lower
energyandduration_msbetween 200000 and 300000 - Cluster 0 seems to be made of tracks with high
instrumentalnessand highenergy
Will document more observations during the second part of the assignment